Language Independent Named Entity Recognition in Indian Languages
نویسندگان
چکیده
This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task. We have used the statistical Conditional Random Fields (CRFs). The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes. The system uses both the language independent as well as language dependent features. The language independent features are applicable for all the languages. The language dependent features have been used for Bengali and Hindi only. One of the difficult tasks of IJCNLP-08 NER Shared task was to identify the nested named entities (NEs) though only the type of the maximal NEs were given. To identify nested NEs, we have used rules that are applicable for all the five languages. In addition to these rules, gazetteer lists have been used for Bengali and Hindi. The system has been trained with Bengali (122,467 tokens), Hindi (502,974 tokens), Telugu (64,026 tokens), Oriya (93,173 tokens) and Urdu (35,447 tokens) data. The system has been tested with the 30,505 tokens of Bengali, 38,708 tokens of Hindi, 6,356 tokens of Telugu, http://ltrc.iiit.ac.in/ner-ssea-08 24,640 tokens of Oriya and 3,782 tokens of Urdu. Evaluation results have demonstrated the highest maximal F-measure of 53.36%, nested F-measure of 53.46% and lexical Fmeasure of 59.39% for Bengali.
منابع مشابه
A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملNamed Entity Recognition for Indian Languages
Abstract Stub This paper talks about a new approach to recognize named entities for Indian languages. Phonetic matching technique is used to match the strings of different languages on the basis of their similar sounding property. We have tested our system with a comparable corpus of English and Hindi language data. This approach is language independent and requires only a set of rules appropri...
متن کاملA Two Stage Language Independent Named Entity Recognition for Indian Languages
This paper describes about the development of a two stage hybrid Named Entity Recognition (NER) system for Indian Languages particularly for Hindi, Oriya, Bengali and Telugu. We have used both statistical Maximum Entropy Model (MaxEnt) and Hidden Markov Model (HMM) in this system. We have used variety of features and contextual information for predicting the various Named Entity (NE) classes. T...
متن کاملسیستم شناسایی و طبقهبندی موجودیتهای اسمی در متون زبان فارسی بر پایه شبکه عصبی
Named Entity Recognition (NER) is a fundamental task in natural language processing and also known as a subset of information extraction. We seek to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, etc. Named Entity Recognition for English texts has been researched widely for the past years, howev...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملCRF-based Named Entity Recognition @ICON 2013
This paper describes performance of CRF based systems for Named Entity Recognition (NER) in Indian language as a part of ICON 2013 shared task. In this task we have considered a set of language independent features for all the languages. Only for English a language specific feature, i.e. capitalization, has been added. Next the use of gazetteer is explored for Bengali, Hindi and English. The ga...
متن کامل